Statistical Disclosure Control |
|
Contact:Peter-Paul de WolfStatistics Netherlands P.O. Box 24500 2490 HA The Hague The Netherlands Phone: +31 70 337 5060 Last update: 10 Oct 2011 |
Methodology testing (WP 5)Leading partner: IStatParticipating partners: IStat, UniMan, StBa, ONSObjectives
This workpackage aims at verifying the effectiveness and applicability of the statistical disclosure control techniques
proposed in WP1.1, WP1.2 and WP3,
in particular those implemented in Argus. Task TM1 (Responsible UniMan)ObjectivesThis builds on the earlier work, which attempted to match household and census records and which yielded very valuable results, in particular highlighting the protection which arises from highly correlated categorical variables. We propose to extend this in the context of the General Household Survey, starting from two or three defined scenarios to define key variables and then assess the availability of matching data across a range of European countries. Data sources for matching attempts will include occupational registers, electoral registers, GP lists, housing information. With appropriate collaboration this could be extended to use government or business files for matching. If the General Household data might eventually not be available an other representative survey will be chosen. Description of the workUsing the General Household Survey we will attempt to mimic an intruder seeking to establish identification. This work will lead to an assessment of the degree to which identification is impeded by the application of the disclosure control methods implemented in ARGUS. The work would need a NSI resources for validating matches/identification and close collaboration with a NSI over appropriate data sources to use. The result would give an estimate of success given a high level of resource input (e.g. 1 year of research time; advanced computing, etc). It will highlight the weakest points in protection and thus indicate the particularly risky variables or combinations of variables. It will also indicate the increased difficulty of identification after disclosure control methods. The work will also employ the new "Data Intrusion Simulation" method recently developed under funding from the UK Economic and Social Research Council and the US Bureau of the Census, which provides estimates of the probabilities of correct matching against a give target file. Milestones and expected resultReport on an evaluation of the availability of data sources which could be used for identification purposes, after 12 months. Task TM2 (responsible UniMan)ObjectivesMuch of disclosure risk research focuses on the control side of the disclosure issue, asking:
"what do we need to do in order to make this data safe?" However, this question is only one side of the problem
that a data provider faces in controlling for risk. All risk control methods degrade the data to some extent and
therefore reduce the ability of data users to conduct the analyses they need for their legitimate purposes. Description of the workTo turn this complex issue into a tractable problem, the work will focus on datasets available from the 1991 UK census.
This will enable the researcher to build on work conducted in preparation for the 2001 census surveying the uses made of
UK census microdata as well as four years of work analysing disclosure risk with such data. The work will allow an empirical
investigation of the feasibility of assessing the impact of disclosure control techniques on analytical power Milestones and expected results
The generation of a prototypical set of data analyses for 1991 UK census data (8 months). Task TM3 (Responsible StBa)ObjectivesTesting the general applicability of the masking algorithm developed as one of the tasks of WP 1.1 by applying it to other business data. Description of the workIn order to test the general applicability of the masking algorithm, tests with data sets different from those used during development will be performed. E.g. the general masking procedure developed in task WP 1.1 will be applied to a complex business panel survey (defining subsets, masking the subsets, evaluating analytical validity and sufficiency of the mask). The results of these tests will highlight the impact of specifics of particular data sets on the use of a masking algorithm and allow some conclusions concerning the general option of disseminating these particular business data. Milestones and expected resultEstablish the scope for effective application of masking techniques for anonymising business data. Task TM4 (Responsible StBa)ObjectivesThe aim of this task is to compare the results of the masking to those of other techniques, especially to sophisticated micro-aggregation. On the basis of this experiences a strategy for the dissemination of business data shall be developed, which probably will mix several techniques. Description of the workWithin this task empirical comparisons of different methods as developed and proposed in WP 1.1 will be performed. Therefore at least two different subsets of the data will be masked applying both Sullivan’s approach and microaggregation techniques (c.f. WP 1.1). The comparisons will consider three different aspects: technical applicability, effectiveness of the method (protection level), analytical validity of the perturbed data. Results will be analysed considering theory. Based on the results of the comparisons a strategy for disseminating cross sectional business data will be proposed, which will probably mix various techniques, combining not only the perturbation techniques as developed in WP 1.1, but also well known non-perturbative techniques such as subsampling, eliminating highly endangered subpopulations, recoding, global and local suppression.Milestones and expected resultA strategy for anonymising cross sectional business data, suggesting a mix of various perturbative and non-perturbative techniques. Task TM5 (Responsible IStat)ObjectivesTo test the effectiveness of the methods proposed in Task T1 of WP 1.1 by means of application on real data. Description of the workThe proposed approach will be tested on a business survey: the Community Innovation Survey (CIS). The CIS involves a mixture of continuous and categorical variables and poses considerable confidentiality problems. Statistical analyses will be carried out in order to evaluate possible distortions resulting from the proposed methodologies. We will also assess how much disclosure protection the methods achieve by applying matching algorithms. Milestones and expected resultWe expect to define a strategy for the creation of a microdata file for research from business survey data. Task TM6 (Responsible StBa)ObjectivesThe task aims at evaluating the effectiveness and applicability of methodology for tabular data-protection as proposed in WP 3. Various data-sets from economic surveys will be used (e.g. business tax statistics, structural business survey, etc.) Description of the workTools and methods for secondary cell suppression as proposed and implemented in τ-ARGUS as tasks of WP 3 and WP 4.2 will be applied to single and multiple tables from various economic surveys. Information loss will be assessed by recording number and sum of values of secondary suppressions, effectiveness of the facilities for control of the selection of secondary suppressions will be tested and strategies for use of these facilities will be proposed. The requirement of computing resources in terms of quantity, expected to depend largely on size and complexity of structure of an application, will be recorded. Milestones and expected resultEstablish the scope and propose strategies for effective application of various cell suppression tools and methods as developed and implemented in τ-ARGUS as tasks of WP 3 and WP 4.1 and WP 4.1. |